feat: add native Parquet file reading support via extension by BingqingLyu · Pull Request #71 · alibaba/neug

BingqingLyu · 2026-03-17T02:57:25Z

What do these changes do?

This PR adds native support for reading Apache Parquet format files in NeuG through a new Parquet extension. Users can now directly load graph data from Parquet files without requiring format conversion, integrating seamlessly with NeuG's existing data import mechanisms.

Related issue number

Fixes #56

Greptile Summary

This PR adds a native Apache Parquet file reading extension to NeuG, enabling users to directly LOAD FROM Parquet files without format conversion. The implementation follows NeuG's existing JSON extension pattern, introducing ArrowParquetOptionsBuilder (extending ArrowOptionsBuilder), a ParquetReadFunction with execFunc/sniffFunc, and an Init()/Name() entry point — all wired into the CMake build system with conditional Arrow Parquet/Snappy/Zlib support and CMake 4.x compatibility patches.

Key issues found:

use_embedded_schema option is a no-op: The option is declared in ParquetParseOptions, prominently documented in both parquet_options.h and load_parquet.md, but buildFragmentOptions() never calls parquetOpts.use_embedded_schema.get(options). Passing use_embedded_schema=false to LOAD FROM has absolutely no effect, which will mislead users.
TestOptionsTranslation_BufferedStream provides false test coverage: The test for the BUFFERED_STREAM=false path asserts only that reader_properties != nullptr, which is always true. The test body even acknowledges Arrow has no getter for the buffered-stream flag, yet the test is not removed or marked as a limitation, giving false confidence this path is verified.
cache_decompressed semantics are misleading: The option is documented as "Cache decompressed column chunks to accelerate repeated reads" but it actually configures Arrow's I/O hole-filling (read coalescing) cache via set_cache_options(). The mapping may also be logically inverted: cache_decompressed=true → LazyDefaults() (lazy=true, on-demand fetching) rather than the more eager Defaults() (lazy=false).
CMake CMP0169 OLD defers migration: Setting CMP0169 OLD suppresses the CMake 4.x error for FetchContent_Populate() but does not migrate to the recommended FetchContent_MakeAvailable() pattern; this should be tracked as a follow-up.

Confidence Score: 3/5

The core extension is structurally sound and follows existing patterns, but a documented option (use_embedded_schema) is completely non-functional, which blocks merging without either fixing the implementation or removing the option from the docs and API.
The overall architecture is well-designed, mirrors the JSON extension correctly, and includes broad test coverage. However, the use_embedded_schema option being silently non-functional is a clear correctness gap that will confuse users. Combined with the misleading cache_decompressed semantics and a test that provides false coverage for BUFFERED_STREAM=false, the implementation has enough quality issues to warrant revisions before merging.
extension/parquet/src/parquet_options.cc requires the most attention — use_embedded_schema must be either implemented or removed, and the cache_decompressed/CacheOptions mapping should be clarified. extension/parquet/tests/parquet_test.cpp also needs the BufferedStream test corrected.

Important Files Changed

Filename	Overview
extension/parquet/src/parquet_options.cc	Core Parquet options builder; has a critical bug where the `use_embedded_schema` option is declared and documented but never read/applied in `buildFragmentOptions()`. Also contains a potential semantic inversion in `cache_decompressed` logic mapping to Arrow's `CacheOptions`.
extension/parquet/include/parquet_options.h	Header defining `ParquetParseOptions` and `ArrowParquetOptionsBuilder`. Well-structured and follows the existing CSV builder pattern, but `use_embedded_schema` is declared here yet never used in the implementation.
extension/parquet/include/parquet_read_function.h	Implements `ParquetReadFunction` with `execFunc` and `sniffFunc`. Pattern mirrors the JSON extension correctly. `sniffFunc` has proper error handling; `execFunc` error handling depends on `ArrowReader::read()` always throwing on failure.
extension/parquet/src/parquet_extension.cpp	Minimal extension entry point registering `ParquetReadFunction` and extension metadata. Correctly mirrors the JSON extension pattern with proper exception handling.
extension/parquet/tests/parquet_test.cpp	Comprehensive C++ test suite covering options translation, type mapping, integration, and multi-file handling. The `TestOptionsTranslation_BufferedStream` test does not actually verify the disabled state and provides false assurance, as Arrow doesn't expose a getter for buffered-stream status.
cmake/BuildArrowAsThirdParty.cmake	Extended to conditionally enable Arrow Parquet and compression codec support. Includes a CMake 4.x compatibility patch for `FetchContent_Populate()` and Arrow's `ThirdpartyToolchain.cmake`. The `CMP0169 OLD` policy suppresses an error but defers the real migration to `FetchContent_MakeAvailable()`.
tools/python_bind/tests/test_load.py	Adds well-structured Python integration tests for Parquet LOAD FROM, column projection, WHERE filtering, ORDER BY, and COPY FROM. Properly guarded by the `NEUG_RUN_PARQUET_TESTS` environment variable following the JSON test pattern.
doc/source/extensions/load_parquet.md	User-facing documentation for the Parquet extension. Documents `use_embedded_schema` as a supported option, but this option currently has no effect in the implementation, which would mislead users.

Sequence Diagram

sequenceDiagram
    participant User as User (Cypher)
    participant Ext as ParquetExtension (Init/Name)
    participant RF as ParquetReadFunction
    participant FSP as LocalFileSystemProvider
    participant OB as ArrowParquetOptionsBuilder
    participant AR as ArrowReader
    participant Arrow as Apache Arrow Dataset API

    User->>Ext: LOAD PARQUET / INSTALL PARQUET
    Ext->>RF: registerFunction(ParquetReadFunction)

    User->>RF: LOAD FROM "file.parquet" RETURN *
    RF->>FSP: provide(fileSchema)
    FSP-->>RF: resolvedPaths + fileSystem

    alt Schema inference (sniffFunc)
        RF->>OB: ArrowParquetOptionsBuilder(state)
        RF->>AR: ArrowReader(state, optionsBuilder, fs)
        AR->>Arrow: inferSchema()
        Arrow-->>AR: Arrow Schema
        AR-->>RF: EntrySchema
    end

    RF->>OB: build()
    OB->>OB: buildFragmentOptions()<br/>(buffered_stream, pre_buffer,<br/>cache_decompressed, row_batch_size)
    OB->>OB: buildFileFormat(scanOptions)<br/>(ParquetFileFormat + default_fragment_scan_options)
    OB-->>RF: ArrowOptions(scanOptions, fileFormat)

    RF->>AR: read(localState, ctx)
    AR->>Arrow: createScanner(fs)
    Arrow-->>AR: Scanner
    AR->>Arrow: full_read / batch_read
    Arrow-->>AR: RecordBatches
    AR-->>RF: execution::Context
    RF-->>User: Result rows

Comments Outside Diff (1)

extension/parquet/tests/parquet_test.cpp, line 1187-1210 (link)

TestOptionsTranslation_BufferedStream does not actually verify the disabled case

The test sets BUFFERED_STREAM=false and then its only assertion is:
```
EXPECT_NE(parquetFragmentOpts->reader_properties, nullptr)
    << "Extension should configure reader_properties based on BUFFERED_STREAM option";
```
This assertion is always true regardless of whether buffered stream is enabled or disabled (the previous test suite already verifies reader_properties is non-null with default options). The comment in the test even acknowledges this: "Arrow doesn't expose a getter for is_buffered_stream_enabled".

Because parquet::ReaderProperties does not expose a getter for the buffered-stream flag, the only meaningful way to test this branch is either:
- Verify I/O timing/behavior in an integration test, or
- Remove the test case since it cannot differentiate between enabled and disabled states.
As written, this test gives false confidence that the BUFFERED_STREAM=false path is exercised correctly.

_{Last reviewed commit: 1fb1d56}

Greptile also left 4 inline comments on this PR.

_{(5/5) You can turn off certain types of comments like style here!}

greptile-apps · 2026-03-17T03:05:43Z

extension/parquet/src/parquet_options.cc

+  ParquetParseOptions parquetOpts;
+  ReadOptions readOpts;
+
+  // Configure Parquet-specific reader properties
+  auto reader_properties = std::make_shared<parquet::ReaderProperties>();
+
+  // Enable buffered stream if configured
+  if (parquetOpts.buffered_stream.get(options)) {
+    reader_properties->enable_buffered_stream();
+  }
+
+  // Set I/O buffer size in bytes
+  int64_t buffer_size = readOpts.batch_size.get(options);
+  reader_properties->set_buffer_size(buffer_size);
+
+  fragment_scan_options->reader_properties = reader_properties;
+
+  // Configure Arrow-specific reader properties
+  auto arrow_reader_properties = std::make_shared<parquet::ArrowReaderProperties>();
+
+  // Set Arrow row batch size (number of rows per batch)
+  int64_t row_batch_size = parquetOpts.row_batch_size.get(options);
+  arrow_reader_properties->set_batch_size(row_batch_size);
+
+  // Use threads setting from general read options
+  arrow_reader_properties->set_use_threads(readOpts.use_threads.get(options));
+
+  // Configure pre-buffering for high-latency filesystems
+  arrow_reader_properties->set_pre_buffer(parquetOpts.pre_buffer.get(options));
+
+  // Configure caching of decompressed data
+  if (parquetOpts.cache_decompressed.get(options)) {
+    arrow_reader_properties->set_cache_options(
+        arrow::io::CacheOptions::LazyDefaults());
+  } else {
+    arrow_reader_properties->set_cache_options(
+        arrow::io::CacheOptions::Defaults());
+  }
+
+  fragment_scan_options->arrow_reader_properties = arrow_reader_properties;
+
+  return fragment_scan_options;


use_embedded_schema option declared but never read

ParquetParseOptions defines use_embedded_schema (default: true) and it is documented in both parquet_options.h and load_parquet.md as a supported option, but buildFragmentOptions() never calls parquetOpts.use_embedded_schema.get(options). Every other option (buffered_stream, pre_buffer, cache_decompressed, row_batch_size) is consumed, but this one is completely ignored, so passing use_embedded_schema=false to LOAD FROM has no observable effect.

The option should either be wired into the Arrow ParquetFragmentScanOptions / ArrowReaderProperties (e.g., to decide whether to trust the embedded schema vs. infer from data), or removed from the struct, the header doc, and the user-facing documentation.

Fixed in ff02825. Removed the use_embedded_schema option entirely — it was declared but never read by buildFragmentOptions(). Arrow's ParquetFileFormat uses the embedded Parquet schema by default and provides no straightforward API to override this at the ParquetFragmentScanOptions level. The option has been removed from ParquetParseOptions, the header doc comment, and load_parquet.md.

greptile-apps · 2026-03-17T03:05:47Z

extension/parquet/src/parquet_options.cc

+  // Configure caching of decompressed data
+  if (parquetOpts.cache_decompressed.get(options)) {
+    arrow_reader_properties->set_cache_options(
+        arrow::io::CacheOptions::LazyDefaults());
+  } else {
+    arrow_reader_properties->set_cache_options(
+        arrow::io::CacheOptions::Defaults());
+  }


cache_decompressed semantic mismatch with Arrow's CacheOptions

The option is documented as "Cache decompressed column chunks to accelerate repeated reads", but ArrowReaderProperties::set_cache_options() configures Arrow's I/O hole-filling cache (read coalescing), not decompressed column chunk caching.

Additionally, the mapping may be logically inverted: cache_decompressed=true applies CacheOptions::LazyDefaults() (lazy=true), which means the cache only loads bytes that are explicitly requested (i.e., does not pre-fetch). CacheOptions::Defaults() (lazy=false) fetches data more eagerly. If the intent is to "cache aggressively for repeated reads", the non-lazy (Defaults) option would be more appropriate when cache_decompressed=true.

Consider:

Renaming the option to something that more accurately reflects what it controls (e.g., enable_io_coalescing or lazy_io_cache), and updating the documentation accordingly.

Verifying whether lazy=true or lazy=false actually delivers the desired read-performance behaviour for the intended use-case.

Fixed in ff02825. Renamed cache_decompressed to enable_io_coalescing to accurately reflect what arrow::io::CacheOptions controls — this is Arrow's I/O read coalescing (hole-filling cache) for non-contiguous byte ranges, not decompressed column chunk caching. The option key is now ENABLE_IO_COALESCING, with updated comments in parquet_options.h, parquet_options.cc, and load_parquet.md. The lazy/eager semantics (LazyDefaults vs Defaults) are preserved as-is, since lazy coalescing (only load explicitly requested ranges) is the natural default when coalescing is enabled.

greptile-apps · 2026-03-17T03:05:48Z

extension/parquet/include/parquet_read_function.h

+  static execution::Context execFunc(
+      std::shared_ptr<reader::ReadSharedState> state) {
+    // Get file system from provider
+    LocalFileSystemProvider fsProvider;
+    auto fileInfo = fsProvider.provide(state->schema.file);
+    state->schema.file.paths = fileInfo.resolvedPaths;
+
+    // Create Parquet-specific options builder
+    auto optionsBuilder =
+        std::make_unique<reader::ArrowParquetOptionsBuilder>(state);
+
+    // Create Arrow reader with Parquet options
+    auto reader = std::make_unique<reader::ArrowReader>(
+        state, std::move(optionsBuilder), fileInfo.fileSystem);
+
+    // Execute read operation
+    execution::Context ctx;
+    auto localState = std::make_shared<reader::ReadLocalState>();
+    reader->read(localState, ctx);
+    return ctx;
+  }


execFunc silently ignores errors from reader->read()

The read() call on line 66 populates ctx but does not return a status. If ArrowReader::read() internally encounters an Arrow error (e.g., corrupted file, schema mismatch), errors may only surface as logged warnings rather than propagated exceptions, leaving ctx in a partially populated state that the caller cannot distinguish from a successful read.

Comparing with the sniffFunc below (lines 97–101), which explicitly checks sniffResult and throws on failure, execFunc has no such guard. It is worth confirming that ArrowReader::read() always throws on failure (rather than logging and continuing) so that errors are not silently swallowed.

Confirmed in ff02825. Added an explicit comment in execFunc to document this. ArrowReader::read() propagates all errors via exceptions (THROW_IO_EXCEPTION / THROW_INVALID_ARGUMENT_EXCEPTION) — see src/utils/reader/reader.cc where every Arrow Result and Status failure path checks .ok() and throws. There is no silent swallowing path.

greptile-apps · 2026-03-17T03:05:50Z

cmake/BuildArrowAsThirdParty.cmake

        # Point Arrow to use the project's RapidJSON
        set(RapidJSON_ROOT "${CMAKE_SOURCE_DIR}/third_party/rapidjson" CACHE PATH "" FORCE)
    endif()
-    set(ARROW_WITH_ZLIB OFF CACHE BOOL "" FORCE)
    set(ARROW_ENABLE_THREADING ON CACHE BOOL "" FORCE)



cmake_policy(SET CMP0169 OLD) suppresses a CMake 4.x deprecation warning but defers the real fix

Setting CMP0169 OLD silences the deprecation of the legacy FetchContent_Populate() pattern, but this is marked as an error in CMake 4.x by default, meaning the workaround must stay in sync with future CMake releases. The FetchContent_MakeAvailable() API (available since CMake 3.14) is the recommended migration path.

This is acceptable as a short-term compatibility fix, but a follow-up to migrate away from FetchContent_Populate() should be tracked, especially since the project already handles CMake 4.x incompatibilities in other parts of this patch.

Acknowledged. The cmake_policy(SET CMP0169 OLD) is an intentional short-term compatibility shim to unblock CMake 4.x builds. Migrating the affected FetchContent_Populate() calls to FetchContent_MakeAvailable() is tracked as a follow-up — the current patch keeps the existing FetchContent call sites working without regressions while the proper migration is planned separately.

- Remove use_embedded_schema option (was declared but never read by buildFragmentOptions(); Arrow always uses the embedded Parquet schema by default and provides no simple API to override this) - Rename cache_decompressed -> enable_io_coalescing to accurately reflect what arrow::io::CacheOptions controls (I/O hole-filling / read coalescing, not decompressed column chunk caching); update comments and docs accordingly - Add comment in execFunc clarifying that ArrowReader::read() propagates all errors via exceptions, not silent logging - Remove TestOptionsTranslation_BufferedStream test which could not meaningfully verify the disabled case (Arrow exposes no getter for is_buffered_stream_enabled); rename TestOptionsTranslation_CacheOptions -> TestOptionsTranslation_IoCoalescing and update option keys to match the renamed ENABLE_IO_COALESCING option

BingqingLyu · 2026-03-17T08:08:41Z

Addressed all review comments in ff02825:

use_embedded_schema is a no-op (P1): Removed the option from ParquetParseOptions, the header doc comment, and load_parquet.md. Arrow's ParquetFileFormat uses the embedded Parquet schema by default and provides no straightforward ParquetFragmentScanOptions-level API to override this.
cache_decompressed semantic mismatch (P2): Renamed to enable_io_coalescing (key: ENABLE_IO_COALESCING). The option controls Arrow's I/O hole-filling / read coalescing cache — not decompressed column chunk caching. Updated the comment in parquet_options.cc, the doc block in parquet_options.h, and the user-facing load_parquet.md.
execFunc error propagation (P2): Verified that ArrowReader::read() throws on all error paths (THROW_IO_EXCEPTION / THROW_INVALID_ARGUMENT_EXCEPTION in reader.cc). Added an explicit comment in execFunc to make this clear.
CMP0169 OLD CMake shim (P2): Acknowledged as intentional short-term fix; FetchContent_Populate() → FetchContent_MakeAvailable() migration tracked as follow-up. No code change needed.
TestOptionsTranslation_BufferedStream false coverage: Removed the test — parquet::ReaderProperties exposes no getter for is_buffered_stream_enabled, so the only assertion (reader_properties != nullptr) was always true regardless of the option value. Renamed TestOptionsTranslation_CacheOptions → TestOptionsTranslation_IoCoalescing and updated it to use the new ENABLE_IO_COALESCING option key.

zhanglei1949 · 2026-03-19T07:45:45Z

.github/workflows/neug-extension-test.yml

  # ============================================================
  extension_tests_default:
    runs-on: [self-hosted]
    container:


Should trigger neug-extension-test.yml at least for this PR.

We should first run neug-extension-test.yml with image: 0.1.0-nightly. If it works find, retag it to 0.1.0 and push. cc @lnfjpt

zhanglei1949 · 2026-03-19T07:47:53Z

CMakeLists.txt


+# Configure Arrow Parquet support if parquet extension is enabled
+if(BUILD_EXTENSIONS AND "parquet" IN_LIST BUILD_EXTENSIONS)
+    set(ARROW_PARQUET ON CACHE BOOL "" FORCE)


ARROW_PARQUET itself is a guard in Arrow CMakeList, maybe we should use a different name here, i.e. ARROW_ENABLE_PARQUET

BingqingLyu added 4 commits March 17, 2026 10:56

support for reading parquet in extension

690bc59

cmake: fix Arrow bundled ExternalProject builds under CMake 4.x

5675122

minor fix

a8732e4

add parquet tests in ci workflows

1fb1d56

BingqingLyu requested review from longbinlai, shirly121 and zhanglei1949 March 17, 2026 02:57

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

BingqingLyu added 3 commits March 17, 2026 15:44

fix(deps): enable Arrow Parquet and Snappy codec support in CI build

2de2137

Merge branch 'main' into read_parquet_extension

a1f7a7a

BingqingLyu added 9 commits March 17, 2026 16:10

Merge branch 'main' into read_parquet_extension

7660e91

enable ARROW_WITH_SNAPPY, ARROW_WITH_ZLIB, and ARROW_WITH_ZSTD

03686a3

Merge remote-tracking branch 'upstream/main' into read_parquet_extension

1c54fba

some fix

501e5f2

format

96907b6

run parquet in complex test only when enabled

ec52430

format

df9eb37

Merge remote-tracking branch 'upstream/main' into read_parquet_extension

232676f

remove install parquet in complex test

5bd8466

zhanglei1949 reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add native Parquet file reading support via extension#71

feat: add native Parquet file reading support via extension#71
BingqingLyu wants to merge 16 commits intoalibaba:mainfrom
BingqingLyu:read_parquet_extension

BingqingLyu commented Mar 17, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

greptile-apps bot Mar 17, 2026

Uh oh!

BingqingLyu Mar 17, 2026

Uh oh!

greptile-apps bot Mar 17, 2026

Uh oh!

BingqingLyu Mar 17, 2026

Uh oh!

greptile-apps bot Mar 17, 2026

Uh oh!

BingqingLyu Mar 17, 2026

Uh oh!

greptile-apps bot Mar 17, 2026

Uh oh!

BingqingLyu Mar 17, 2026

Uh oh!

BingqingLyu commented Mar 17, 2026

Uh oh!

zhanglei1949 Mar 19, 2026

Uh oh!

zhanglei1949 Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BingqingLyu commented Mar 17, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue number

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

BingqingLyu Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

BingqingLyu Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

BingqingLyu Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

BingqingLyu Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

BingqingLyu commented Mar 17, 2026

Uh oh!

zhanglei1949 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

zhanglei1949 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BingqingLyu commented Mar 17, 2026 •

edited by greptile-apps bot

Loading